Webscraping v R

Workshop FF UK 5.10.2023 3️⃣
Vybrané kapitoly z analýzy dat

Renata Topinkova

LMU Munich
📫 renata.topinkova[at]lmu.de

Circling back…


🤔 Is there an API?


Isn’t there an R package for that?

📦 WHO, guardianapi, spotifyR, nytimes, wbstats, RedditExtractoR


Are you sure?

r fontawesome::fa("google", "black") Google, r fontawesome::fa("github", "black") Github


If you’re SURE sure… Generic package

📦 httr, httr2

Webscraping

extracting and copying data from a web page into a structured format using a computer program

Often also referred to as „ screenscraping” = scraping from a computer screen

Screenscraping

✅ Works without API

✅ Flexible

✅ User’s behavior simulation (Selenium)

❌ Often complicated & frustrating (captcha, javascript)

❌ Needle in a haystack

❌ Data wrangling

❌ Websites can change

❌ Legally a gray area

Terms of Service, robots.txt, country laws, purpose, contacting a platform ➡️ a lot of uncertainty, gap between theory & practice, if unsure, seek professional advice


If there is an API available and you don’t need to simulate users’ behavior ➡️ go for API

robots.txt

= Policy that specifies rules about automated data collection on the site

Can be accessed by typing websitename/robots.txt in your browser

E.g.,: https://www.ted.com/robots.txt

Terms of Service

Example: https://www.researchgate.net/terms-of-service

In connection with using or accessing the Service, you shall not:

  • Impose an unreasonable or disproportionately large administrative burden on ResearchGate

  • Use any robot, spider, scraper, data mining tools, data gathering and extraction tools, or other automated means to access our Service for any purpose, except with the prior express permission of ResearchGate in writing

  • Employ any mechanisms, software, or scripts when using the Service

Scraping

Detour: Web architecture

Toy example

<html>
  <head>
    <title> Example website</title>
    <link rel = "stylesheet" href = "styles.css">
  </head>
  <body>
    <div class = "one">
      <h1> This is a heading </h1>
      <p class = "to-scrape" > Scrape me! </p>
    </div>
    <div>
      <p> Do not scrape me! </p>
    </div>
  </body>
</html>

Two main ways - CSS and XPath

Two main ways - CSS and XPath

Two main ways - CSS and XPath

Two main ways - CSS and XPath

Locating elements on websites

  1. Go to any website
  2. Right click on the website, select Inspect (CZ: Prozkoumat)
    (Chrome shortcut Ctrl + Shift + I)
  3. Examine the structure

Locating elements on websites

💡 You can also use SelectorGadget browser add-on to help you find relevant selectors

In R

library(rvest)

read_html("path") - reads in the entire website (STEP 1)

html_elements(x, css/xpath = "") finds element based on css/xpath

html_text() extracts text between tags

html_attr() finds attribute (mostly for <a href = ""> , i.e., links on pages)

html_table() reads in tables


Beware of html_element() same function as html_elements() but returns only 1 element!

Let’s try it together

Open the 03_1_Screenscraping_intro_exercise.qmd

Open the intro.html file.

Now, you try!

Open the `03_1_Screenscraping_intro_exercise.qmd`

  1. Go to quotes.toscrape.com
  2. Get the text of the quotes
  3. Get the authors of the quotes (should match the N of quotes)
  4. Bind them to a dataframe
  5. Bonus: Get links to authors ’ pages